fix: clean up Temporal server-side versioning data on TWD deletion#240
fix: clean up Temporal server-side versioning data on TWD deletion#240anujagrawal380 wants to merge 6 commits intotemporalio:mainfrom
Conversation
|
PTAL @carlydf |
jaypipes
left a comment
There was a problem hiding this comment.
@anujagrawal380 awesome contribution, thank you so much for this PR! I couple really minor comments below, but overall excellent work.
Thanks, resolved both the comments! |
jaypipes
left a comment
There was a problem hiding this comment.
rock on :) nice work on this @anujagrawal380!
carlydf
left a comment
There was a problem hiding this comment.
we need integration tests for this before merging to main / including it in a release
@carlydf Added the integration tests. PTAL |
|
Hi @anujagrawal380 , could you fix the linters! Would love to include this in our next release |
…blocking unversioned workers Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
22607bf to
5a2e844
Compare
…ion finalizer - Add 5-minute deletionCleanupTimeout to prevent TWD stuck in Terminating state indefinitely if Temporal server is unavailable - Return errors from version/deployment deletion to trigger requeue until versions actually clear (pollers disappear as pods terminate) - Add update/patch verbs and finalizers RBAC marker for TemporalConnections - Fix comment-spacing lint on new kubebuilder:rbac markers
d6a305c to
9fd0c74
Compare
@carlydf @jaypipes Added few more minor improvements here: 9fd0c74 . PTAL! |
| // deletionCleanupTimeout is the maximum duration to retry Temporal server-side | ||
| // cleanup before giving up and allowing the K8s resource to be deleted. | ||
| // This prevents the TWD from being stuck in Terminating state indefinitely | ||
| // if the Temporal server is unavailable or a version has persistent active pollers. | ||
| deletionCleanupTimeout = 5 * time.Minute |
There was a problem hiding this comment.
I think the goal of this finalizer is for the kubernets / TemporalWorkerDeployment perspective and the Temporal server perspective to be aligned. So if the server-side object was created by creating the k8s-side object, the server-side object should also be deleted in the same way.
This timeout breaks that expectation. If the server is temporarily unavailable and this finalizer gives up and deletes the k8s-side object, the server-side object would never be deleted.
I'm curious if you ran into this while testing? The default TTL for "active pollers" in server is 5 minutes, so if when the TWD enters Terminating state it is running active pollers, the controller needs to kill those Deployments and then wait 5 minutes before the "no active pollers" check passes. If all versions were Drained and the pods had been scaled down for a while, this delay wouldn't exist.
Because of that 5 minute poller TTL, a 5 minute deletionCleanupTimeout would frequently be used in case of deletion before natural scaledown, which would IMO not be good (because of the leftover server-side object thing I explained above).
If we decide to keep this, I would advocate for:
- A very long threshold, like 1h
- Only timing out on unavailable errors from the server, not precondition failed (which is the "active pollers" thing)
There was a problem hiding this comment.
I will defer to @jaypipes opinion here though!
TemporalWorkerDeploymentto run Temporal server-side cleanup before K8s deletionTemporalConnectionto prevent it from being deleted while any TWD still references itProblem
When a
TemporalWorkerDeploymentCRD is deleted (e.g., switching back to plain Deployments), the Temporal server retains the build ID routing configuration. The matching service continues routing new tasks to the deleted build ID's physical queue, while unversioned workers poll a different physical queue. Tasks sit inScheduledstate indefinitely with no errors.A secondary race condition exists: Helm deletes both the
TemporalConnectionandTWDin the same upgrade. Without the connection, the controller cannot talk to Temporal to clean up. This is solved by adding a finalizer to theTemporalConnectionthat blocks its deletion until all referencing TWDs are gone.Changes
internal/controller/worker_controller.go:TWD finalizer (
temporal.io/worker-deployment-cleanup):handleDeletion()which:BuildID: "") -- the critical step that unblocks task dispatchSkipDrainage: trueTemporalConnection finalizer (
temporal.io/connection-in-use):TemporalConnectionduring normal TWD reconciliation viaensureConnectionFinalizer()removeConnectionFinalizerIfUnused()during TWD deletion, after checking no other TWDs in the same namespace reference the connectionRBAC updates:
update;patchverbs fortemporalconnections(wasget;list;watch)updateverb fortemporalconnections/finalizersDeletion flow
Issue #55
Closes #166